Body Fat Prediction

image.png

1 | Importing Libraries and Loading dataset¶

In [61]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import xgboost as xgb
colors = ['#ffcd94', '#eac086', '#ffad60', '#ffe39f', '#ffd700', '#ff8c00', '#ff6347', '#deb887', '#f4a460', '#cd853f']
sns.set_palette(sns.color_palette(colors))
In [5]:
df = pd.read_csv(r"C:\Users\HP\Downloads\bodyfat .csv")
df.head()
Out[5]:
Density BodyFat Age Weight Height Neck Chest Abdomen Hip Thigh Knee Ankle Biceps Forearm Wrist
0 1.0708 12.3 23 154.25 67.75 36.2 93.1 85.2 94.5 59.0 37.3 21.9 32.0 27.4 17.1
1 1.0853 6.1 22 173.25 72.25 38.5 93.6 83.0 98.7 58.7 37.3 23.4 30.5 28.9 18.2
2 1.0414 25.3 22 154.00 66.25 34.0 95.8 87.9 99.2 59.6 38.9 24.0 28.8 25.2 16.6
3 1.0751 10.4 26 184.75 72.25 37.4 101.8 86.4 101.2 60.1 37.3 22.8 32.4 29.4 18.2
4 1.0340 28.7 24 184.25 71.25 34.4 97.3 100.0 101.9 63.2 42.2 24.0 32.2 27.7 17.7

👉 | About the dataset

Dataset Overview: Body Fat Measurements¶

Context:¶

This dataset provides estimates of body fat percentage determined through underwater weighing, alongside various body circumference measurements for 252 men. The goal is to develop predictive models for estimating body fat based on simpler and less invasive measurements.

Educational Use:¶

This dataset is ideal for demonstrating multiple regression techniques. Since accurately measuring body fat through underwater weighing is both inconvenient and costly, this dataset helps illustrate how to estimate body fat using more accessible body circumference measurements.r

Measurement Standards: Measurements follow the standards outlined in Benhke and Wilmore (1974), pages 45-48. For instance, the abdomen 2 circumference is measured laterally at the iliac crests and anteriorly at the umbilicus.

Application:¶

These data are used to produce predictive equations for lean body weight as discussed in the abstract "Generalized Body Composition Prediction Equation for Men Using Simple Measurement Techniques" by Penrose, Nelson, and Fisher, published in Medicine and Science in Sports and Exercise, vol. 17, no. 2, April 1985, p. 189. The predictive equations were developed from the first 143 of the 252 cases provided in this dataset.

2 | Understanding Our Data¶

👉 |Shape

In [16]:
#What is the shape of the dataset?
df.shape
Out[16]:
(252, 15)

👉 | Information

In [4]:
# Extract information about DataFrame
df_info = pd.DataFrame({
    'Non-Null Count': df.notnull().sum(),
    'Data Type': df.dtypes
})

# Apply stylish formatting with custom colors
styled_df_info = (
    df_info.style
    .set_properties(**{
        'background-color': 'black',  # Background color for the entire table
        'color': '#eac086',  # Text color
        'border': '1px solid black',  # Border color
        'padding': '8px'  # Padding for cells
    })
    .set_caption('DataFrame Information: Attributes and Data Types')  # Add a title to the table
    .set_table_styles([
        {'selector': 'th', 'props': [('background-color', '#eac086')]},  # Bad heading background color
    ])
)

# Display the styled DataFrame
styled_df_info
Out[4]:
DataFrame Information: Attributes and Data Types
  Non-Null Count Data Type
Density 252 float64
BodyFat 252 float64
Age 252 int64
Weight 252 float64
Height 252 float64
Neck 252 float64
Chest 252 float64
Abdomen 252 float64
Hip 252 float64
Thigh 252 float64
Knee 252 float64
Ankle 252 float64
Biceps 252 float64
Forearm 252 float64
Wrist 252 float64
In [19]:
#Some analysis on the numerical columns
df.describe()
Out[19]:
Density BodyFat Age Weight Height Neck Chest Abdomen Hip Thigh Knee Ankle Biceps Forearm Wrist
count 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000 252.000000
mean 1.055574 19.150794 44.884921 178.924405 70.148810 37.992063 100.824206 92.555952 99.904762 59.405952 38.590476 23.102381 32.273413 28.663889 18.229762
std 0.019031 8.368740 12.602040 29.389160 3.662856 2.430913 8.430476 10.783077 7.164058 5.249952 2.411805 1.694893 3.021274 2.020691 0.933585
min 0.995000 0.000000 22.000000 118.500000 29.500000 31.100000 79.300000 69.400000 85.000000 47.200000 33.000000 19.100000 24.800000 21.000000 15.800000
25% 1.041400 12.475000 35.750000 159.000000 68.250000 36.400000 94.350000 84.575000 95.500000 56.000000 36.975000 22.000000 30.200000 27.300000 17.600000
50% 1.054900 19.200000 43.000000 176.500000 70.000000 38.000000 99.650000 90.950000 99.300000 59.000000 38.500000 22.800000 32.050000 28.700000 18.300000
75% 1.070400 25.300000 54.000000 197.000000 72.250000 39.425000 105.375000 99.325000 103.525000 62.350000 39.925000 24.000000 34.325000 30.000000 18.800000
max 1.108900 47.500000 81.000000 363.150000 77.750000 51.200000 136.200000 148.100000 147.700000 87.300000 49.100000 33.900000 45.000000 34.900000 21.400000

👉 | Null Values Handling

In [90]:
# Calculate the number of null values and their percentages
null_counts = df.isnull().sum()
total_rows = len(df)
null_percentages = (null_counts / total_rows) * 100

# Create a DataFrame to display the counts and percentages
null_summary = pd.DataFrame({
    'Null Values': null_counts,
    'Percentage': null_percentages
})

# Apply stylish formatting with custom colors
styled_null_summary = (
    null_summary.style
    .format({'Percentage': '{:.2f}%'})  # Format percentage to two decimal places
    .background_gradient(cmap='coolwarm', subset=['Percentage'])  # Apply a gradient to the 'Percentage' column
    .highlight_max(subset=['Null Values'], color='lightcoral')  # Highlight the row with the maximum null values
    .set_caption('Summary of Null Values and Their Percentages')  # Add a title to the table
    .set_table_styles([
        {'selector': 'thead th', 'props': [('background-color', '#eac086'),  # Header background color
                                           ('color', 'black'),  # Header text color
                                           ('font-weight', 'bold')]},
        {'selector': 'tbody tr:hover', 'props': [('background-color', '#eac086')]},  # Hover effect with background color
        {'selector': 'tbody td', 'props': [('background-color', 'black'),  # Table body background color
                                           ('color', '#eac086'),  # Table body text color
                                           ('border', '1px solid #eac086'),  # Border color
                                           ('padding', '8px')]}
    ])
)

# Display the styled DataFrame
styled_null_summary
Out[90]:
Summary of Null Values and Their Percentages
  Null Values Percentage
Density 0 0.00%
BodyFat 0 0.00%
Age 0 0.00%
Weight 0 0.00%
Height 0 0.00%
Neck 0 0.00%
Chest 0 0.00%
Abdomen 0 0.00%
Hip 0 0.00%
Thigh 0 0.00%
Knee 0 0.00%
Ankle 0 0.00%
Biceps 0 0.00%
Forearm 0 0.00%
Wrist 0 0.00%

Great we have no null values in the dataset!

Great we have no duplicate values in the dataset!

3 | Exploratory Data Analysis¶

👉 | Plotting The Features

In [34]:
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import scipy.stats as stats

warnings.filterwarnings('ignore')

# Customize the color palette
colors = ['#eac086', '#ffcd94', '#e57373']

# Create subplots
fig, ax = plt.subplots(15, 3, figsize=(30, 90))

for index, column in enumerate(df.select_dtypes(include='number').columns):
    # Distribution Plot with KDE
    sns.histplot(df[column], kde=True, color=colors[0], alpha=0.9, ax=ax[index, 0], bins=30, edgecolor='black')
    ax[index, 0].set_title(f'Distribution Plot of {column}', fontsize=14, weight='bold')
    ax[index, 0].set_xlabel(column, fontsize=12)
    ax[index, 0].set_ylabel('Frequency', fontsize=12)
    ax[index, 0].grid(True)

    # Boxplot
    sns.boxplot(x=df[column], ax=ax[index, 1], color=colors[1], saturation=0.9)
    ax[index, 1].set_title(f'Box Plot of {column}', fontsize=14, weight='bold')
    ax[index, 1].set_xlabel(column, fontsize=12)
    ax[index, 1].grid(True)

    # Q-Q Plot for Normality Check
    stats.probplot(df[column].dropna(), plot=ax[index, 2])
    ax[index, 2].get_lines()[1].set_color(colors[2])  # Line of the Q-Q plot
    ax[index, 2].get_lines()[1].set_linewidth(2)
    ax[index, 2].set_title(f'Q-Q Plot of {column}', fontsize=14, weight='bold')
    ax[index, 2].grid(True)

# Improve overall layout and add a main title
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.subplots_adjust(top=0.95, hspace=0.4)
plt.suptitle("Visualizing Continuous Columns", fontsize=50, weight='bold', color='black')

plt.show()
No description has been provided for this image

Observations¶

  • The dataset has some outliers.
  • Some columns such as - Height,Ankle,Age etc are skewed

We see Ankle , Hip , Weight,Height are the most skewed columns

👉 | Handling Skewness

In [6]:
# Step 1: Calculate Skewness Before Transformation
skewness_before = df.skew(axis=0).sort_values()
skewness_df_before = pd.DataFrame(skewness_before, columns=['Skewness Before'])

# Step 2: Apply Transformations and Store the Transformed Features
transformed_features = {}
skewness_after = []

for col in df.select_dtypes(include='number').columns:
    # Check if column values are all positive (required for Box-Cox transformation)
    if (df[col] > 0).all():
        transformed_data, fitted_lambda = boxcox(df[col].dropna())
        transformed_features[col] = transformed_data  # Store transformed data
        skewness_after.append({'Column': col, 'Skewness After': skew(transformed_data)})
    else:
        # Apply log transformation for non-positive values (handle zeros by adding a small constant)
        transformed_data = np.log1p(df[col] - df[col].min() + 1)
        transformed_features[col] = transformed_data
        skewness_after.append({'Column': col, 'Skewness After': skew(transformed_data)})

# Convert skewness after transformation to DataFrame
skewness_df_after = pd.DataFrame(skewness_after).set_index('Column')

# Step 3: Use Pandas Styling to Display Skewness Tables Before and After Transformation
styled_skewness_before = (
    skewness_df_before.style
    .background_gradient(cmap='coolwarm')
    .set_properties(**{
        'background-color': '#eac086',  # Set custom background color
        'color': 'black',  # Set text color to black
        'border': '1px solid black',  # Border color
        'padding': '8px'  # Padding for better readability
    })
    .set_caption('------- Column Skewness Before Transformation ------')
)

styled_skewness_after = (
    skewness_df_after.style
    .background_gradient(cmap='coolwarm')
    .set_properties(**{
        'background-color': '#eac086',  # Set custom background color
        'color': 'black',  # Set text color to black
        'border': '1px solid black',  # Border color
        'padding': '8px'  # Padding for better readability
    })
    .set_caption('---- Column Skewness After Transformation ---')
)

# Display the styled DataFrames
display(styled_skewness_before)
------- Column Skewness Before Transformation ------
  Skewness Before
Height -5.384987
Forearm -0.219333
Density -0.020176
BodyFat 0.146353
Wrist 0.281614
Age 0.283521
Biceps 0.285530
Knee 0.516744
Neck 0.552620
Chest 0.681556
Thigh 0.821210
Abdomen 0.838418
Weight 1.205263
Hip 1.497127
Ankle 2.255134
In [91]:
display(styled_skewness_after)
---- Column Skewness After Transformation ---
  Skewness After
Column  
Density -0.002732
BodyFat -1.184503
Age -0.028675
Weight -0.012174
Height 0.160706
Neck -0.016034
Chest -0.005164
Abdomen -0.004243
Hip -0.045212
Thigh -0.015753
Knee -0.005937
Ankle -0.113418
Biceps 0.000303
Forearm 0.028126
Wrist -0.001051

👉 | new DataFrame Improved

In [8]:
# Step 4: Using Transformed Features for Further Analysis
# Create a new DataFrame with transformed features
df2 = pd.DataFrame(transformed_features)
df2.columns = [f"{col}_transformed" for col in df2.columns]
In [7]:
df2.head()
Out[7]:
Density_transformed BodyFat_transformed Age_transformed Weight_transformed Height_transformed Neck_transformed Chest_transformed Abdomen_transformed Hip_transformed Thigh_transformed Knee_transformed Ankle_transformed Biceps_transformed Forearm_transformed Wrist_transformed
0 0.071748 2.660260 7.566358 1.748828 5.309691e+09 1.334254 0.671926 1.056564 0.384054 0.843417 0.749983 0.296298 3.995944 245.395629 1.232805
1 0.086672 2.091864 7.356642 1.756517 7.673162e+09 1.339406 0.671932 1.056142 0.384054 0.843375 0.749983 0.296299 3.932654 270.859227 1.241166
2 0.041726 3.306887 7.356642 1.748717 4.670870e+09 1.328781 0.671959 1.057053 0.384054 0.843500 0.750324 0.296300 3.857367 210.142920 1.228695
3 0.076166 2.517696 8.169442 1.760570 7.673162e+09 1.337009 0.672025 1.056785 0.384054 0.843568 0.749983 0.296299 4.012360 279.602364 1.241166
4 0.034220 3.424263 7.771548 1.760402 7.084643e+09 1.329820 0.671976 1.058933 0.384054 0.843964 0.750934 0.296300 4.004175 250.396172 1.237475
In [92]:
# Extract information about DataFrame
df_info = pd.DataFrame({
    'Non-Null Count': df2.notnull().sum(),
    'Data Type': df2.dtypes
})

# Apply stylish formatting with custom colors
styled_df_info = (
    df_info.style
    .set_properties(**{
        'background-color': 'black',  # Background color for the entire table
        'color': '#eac086',  # Text color
        'border': '1px solid black',  # Border color
        'padding': '8px'  # Padding for cells
    })
    .set_caption('DataFrame Information: Attributes and Data Types')  # Add a title to the table
    .set_table_styles([
        {'selector': 'th', 'props': [('background-color', '#eac086')]},  # Bad heading background color
    ])
)

# Display the styled DataFrame
styled_df_info
Out[92]:
DataFrame Information: Attributes and Data Types
  Non-Null Count Data Type
Density_transformed 252 float64
BodyFat_transformed 252 float64
Age_transformed 252 float64
Weight_transformed 252 float64
Height_transformed 252 float64
Neck_transformed 252 float64
Chest_transformed 252 float64
Abdomen_transformed 252 float64
Hip_transformed 252 float64
Thigh_transformed 252 float64
Knee_transformed 252 float64
Ankle_transformed 252 float64
Biceps_transformed 252 float64
Forearm_transformed 252 float64
Wrist_transformed 252 float64
In [ ]:
 

👉 |Corrleation Matrix

In [51]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from sklearn.decomposition import PCA
from scipy.stats import pearsonr, skew
from scipy.special import boxcox1p

# Step 1: Exploratory Data Analysis (EDA)
plt.figure(figsize=(14, 10))
sns.heatmap(df2.corr(), annot=True, cmap=sns.diverging_palette(230, 20, as_cmap=True), center=0, linewidths=0.5, square=True)
plt.title('Correlation Matrix of Features')
plt.show()
No description has been provided for this image

👉 | PairPlots

In [96]:
# Set the plotting style and palette
sns.set_palette(sns.color_palette(["#000000", "#eac086"]))

# Create the pair plot
plt.figure(figsize=(15, 15))  # Increase the figure size
sns.pairplot(df2[['Density_transformed', 'BodyFat_transformed', 'Age_transformed',
                   'Weight_transformed', 'Height_transformed', 'Neck_transformed',
                   'Chest_transformed', 'Abdomen_transformed', 'Hip_transformed',
                   'Thigh_transformed', 'Knee_transformed', 'Ankle_transformed',
                   'Biceps_transformed', 'Forearm_transformed', 'Wrist_transformed']],
             diag_kind='kde', markers='o', height=3, aspect=1, kind='scatter')

# Set the title and show the plot
plt.suptitle('Pair Plot of Features')
plt.show()
<Figure size 1500x1500 with 0 Axes>
No description has been provided for this image
In [111]:
correlations = df2.corr()['BodyFat_transformed'].sort_values(ascending=False)

# Filter features highly correlated with BodyFat
significant_features = correlations[abs(correlations) > 0.3].index.tolist()

# Create a DataFrame with the correlation values
corr_df = pd.DataFrame(correlations).reset_index()
corr_df.columns = ['Feature', 'Correlation']

# Function to apply color formatting based on correlation value
def color_corr(val):
    color = '#4CAF50' if val > 0.3 else ('#F44336' if val < -0.3 else '#ffffff')  # Green for positive, Red for negative, White for neutral
    return f'background-color: {color}; color: black'

# Apply the styling
styled_corr_df = corr_df.style.map(color_corr, subset=['Correlation']).set_table_styles(
    [{'selector': 'thead',
      'props': [('background-color', '#eac086'), ('color', 'black'), ('font-weight', 'bold')]}]
).set_properties(**{'background-color': 'black', 'color': 'white'})

# Display the styled DataFrame
styled_corr_df
Out[111]:
  Feature Correlation
0 BodyFat_transformed 1.000000
1 Abdomen_transformed 0.790485
2 Chest_transformed 0.678641
3 Hip_transformed 0.624756
4 Weight_transformed 0.618177
5 Thigh_transformed 0.576846
6 Knee_transformed 0.512810
7 Biceps_transformed 0.491864
8 Neck_transformed 0.464909
9 Forearm_transformed 0.357903
10 Wrist_transformed 0.345377
11 Ankle_transformed 0.295833
12 Age_transformed 0.281622
13 Height_transformed -0.000258
14 Density_transformed -0.945249

👉 | Separating Independent and Dependent Variables

In [9]:
X = df2.drop(columns=['BodyFat_transformed'], axis=1)
y = df2['BodyFat_transformed']

4 | Feature Engineering and Preprocessing¶

👉 | Adding Some Features

In [10]:
# Step 2: Feature Engineering
X['BMI'] = X['Weight_transformed'] / (X['Height_transformed'] / 100) ** 2  # Body Mass Index
X['WaistToHipRatio'] = X['Abdomen_transformed'] / X['Hip_transformed']  # Waist-to-Hip Ratio
X['BodySurfaceArea'] = 0.007184 * (X['Height_transformed'] ** 0.725) * (X['Weight_transformed'] ** 0.425)  # Body Surface Area
X['AgeSquared'] = X['Age_transformed'] ** 2  # Age squared to capture non-linear effects
X['AbdomenToChestRatio'] = X['Abdomen_transformed'] / X['Chest_transformed']  # Abdomen-to-Chest Ratio

# Step 7: Domain-Specific Insights and Custom Feature Extraction
X['UpperBodyFat'] = X['Neck_transformed'] + X['Chest_transformed'] + X['Biceps_transformed']
X['LowerBodyFat'] = X['Thigh_transformed'] + X['Knee_transformed'] + X['Ankle_transformed']
X['ArmFatIndex'] = (X['Biceps_transformed'] + X['Forearm_transformed']) / X['Wrist_transformed']

👉 | Dropping Features

In [11]:
X.drop(['Weight_transformed', 'Neck_transformed', 'Biceps_transformed', 'Knee_transformed', 'Ankle_transformed', 'Forearm_transformed', 'Wrist_transformed' ,'Height_transformed','Abdomen_transformed','Chest_transformed','Hip_transformed','Thigh_transformed'],axis=1,inplace=True)
In [11]:
X.columns
Out[11]:
Index(['Density_transformed', 'Age_transformed', 'BMI', 'WaistToHipRatio',
       'BodySurfaceArea', 'AgeSquared', 'AbdomenToChestRatio', 'UpperBodyFat',
       'LowerBodyFat', 'ArmFatIndex'],
      dtype='object')
In [66]:
X.head(2)
Out[66]:
Density_transformed Age_transformed BMI WaistToHipRatio BodySurfaceArea AgeSquared AbdomenToChestRatio UpperBodyFat LowerBodyFat ArmFatIndex
0 0.071748 7.566358 6.203097e-16 2.751084 102378.568243 57.249780 1.572441 6.002123 1.889697 202.296087
1 0.086672 7.356642 2.983345e-16 2.749984 133952.229459 54.120175 1.571798 5.943992 1.889657 221.398142

👉 | Feature Extraction By Correlation

In [112]:
correlations = df2.corr()['BodyFat_transformed'].sort_values(ascending=False)

# Filter features highly correlated with BodyFat
significant_features = correlations[abs(correlations) > 0.3].index.tolist()

# Create a DataFrame with the correlation values
corr_df = pd.DataFrame(correlations).reset_index()
corr_df.columns = ['Feature', 'Correlation']

# Function to apply color formatting based on correlation value
def color_corr(val):
    color = '#4CAF50' if val > 0.3 else ('#F44336' if val < -0.3 else '#ffffff')  # Green for positive, Red for negative, White for neutral
    return f'background-color: {color}; color: black'

# Apply the styling
styled_corr_df = corr_df.style.map(color_corr, subset=['Correlation']).set_table_styles(
    [{'selector': 'thead',
      'props': [('background-color', '#eac086'), ('color', 'black'), ('font-weight', 'bold')]}]
).set_properties(**{'background-color': 'black', 'color': 'white'})

# Display the styled DataFrame
styled_corr_df
Out[112]:
  Feature Correlation
0 BodyFat_transformed 1.000000
1 Abdomen_transformed 0.790485
2 Chest_transformed 0.678641
3 Hip_transformed 0.624756
4 Weight_transformed 0.618177
5 Thigh_transformed 0.576846
6 Knee_transformed 0.512810
7 Biceps_transformed 0.491864
8 Neck_transformed 0.464909
9 Forearm_transformed 0.357903
10 Wrist_transformed 0.345377
11 Ankle_transformed 0.295833
12 Age_transformed 0.281622
13 Height_transformed -0.000258
14 Density_transformed -0.945249

👉 | Feature Extraction By Recurvie Feature Elimination (RFE)

In [94]:
# Provided feature importances
rfe_features = [
    'Density_transformed',
    'Age_transformed',
    'WaistToHipRatio',
    'BodySurfaceArea',
    'AgeSquared',
    'AbdomenToChestRatio',
    'UpperBodyFat',
    'LowerBodyFat',
    'ArmFatIndex'
]
rfe_importances = [
    0.9481,
    0.0004,
    0.0098,
    0.0081,
    0.0014,
    0.0113,
    0.0056,
    0.0111,
    0.0041
]

# Create a DataFrame
df_rfe = pd.DataFrame({
    'Selected Feature': rfe_features,
    'Importance': rfe_importances
})

# Define styling function
def style_df(df):
    return df.style.set_table_styles(
        [{'selector': 'thead th',
          'props': [('background-color', '#eac086'),
                    ('color', 'black'),
                    ('font-weight', 'bold'),
                    ('text-align', 'center'),
                    ('font-size', '14px')]},
         {'selector': 'td',
          'props': [('padding', '10px'),
                    ('background-color', '#000000'),
                    ('color', '#eac086'),
                    ('text-align', 'center'),
                    ('font-size', '12px')]},
         {'selector': 'table',
          'props': [('border-collapse', 'collapse'),
                    ('width', '60%'),
                    ('margin', '20px auto'),
                    ('border', '2px solid #000000')]},
         {'selector': 'tr:nth-of-type(even)',
          'props': [('background-color', '#f9f9f9')]},
         {'selector': 'tr:nth-of-type(odd)',
          'props': [('background-color', '#ffffff')]}]
    ).set_properties(**{'text-align': 'center'}).hide(axis='index')

# Apply styling to the DataFrame
styled_df_rfe = style_df(df_rfe)

# Display the styled DataFrame
styled_df_rfe
Out[94]:
Selected Feature Importance
Density_transformed 0.948100
Age_transformed 0.000400
WaistToHipRatio 0.009800
BodySurfaceArea 0.008100
AgeSquared 0.001400
AbdomenToChestRatio 0.011300
UpperBodyFat 0.005600
LowerBodyFat 0.011100
ArmFatIndex 0.004100
In [ ]:
 

Here we have created 3 new columns namely -¶

  • Bmi - Body Mass Index.
  • ACratio - Abdomen Chest Ratio
  • HTratio - Hip Thigh Ratio

This will help us reduce some of the problems caused due to multicollinearity

👉 | Removing Outliers

In [95]:
from sklearn.ensemble import IsolationForest

# Assuming X and y are your feature matrix and target vector
# Initialize the Isolation Forest model
iso_forest = IsolationForest(contamination=0.05, random_state=42)  # Adjust contamination as needed

# Fit the model and predict outliers
outliers = iso_forest.fit_predict(X) == -1

# Filter out outliers from the dataset
X_clean = X[~outliers]
y_clean = y[~outliers]

# Output the number of rows before and after cleaning
original_rows = X.shape[0]
cleaned_rows = X_clean.shape[0]
rows_removed = original_rows - cleaned_rows

# Prepare data for styling
summary_data = {
    'Metric': ['Original Number of Rows', 'Number of Rows After Removing Outliers', 'Number of Rows Removed'],
    'Value': [original_rows, cleaned_rows, rows_removed]
}

# Convert to DataFrame
summary_df = pd.DataFrame(summary_data)

# Define a function to apply styling
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: #E6E6FA' if v else '' for v in is_max]

# Apply styling
styled_summary_df = summary_df.style.apply(highlight_max, axis=0).set_table_styles(
    [{'selector': 'thead th',
      'props': [('background-color', '#eac086'),
                ('color', 'black'),
                ('font-weight', 'bold')]},
     {'selector': 'td',
      'props': [('padding', '10px'),
                ('background-color', '#F5F5F5')]},
     {'selector': 'table',
      'props': [('border-collapse', 'collapse'),
                ('width', '50%')]},
     {'selector': 'tr:nth-of-type(even)',
      'props': [('background-color', '#FAFAFA')]},
     {'selector': 'tr:nth-of-type(odd)',
      'props': [('background-color', '#FFFFFF')]}]
)

# Display the styled DataFrame
styled_summary_df
Out[95]:
  Metric Value
0 Original Number of Rows 252
1 Number of Rows After Removing Outliers 239
2 Number of Rows Removed 13

👉 | Splitting into train and test set

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Use the cleaned dataset after outlier removal
X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, test_size=0.2, random_state=42)

# Prepare data for styling
summary_data = {
    'Dataset': ['Training Features', 'Test Features', 'Training Labels', 'Test Labels'],
    'Shape': [X_train.shape, X_test.shape, y_train.shape, y_test.shape]
}

# Convert to DataFrame
summary_df = pd.DataFrame(summary_data)

# Define a function to apply styling
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: #eac086' if v else '' for v in is_max]

# Apply styling
styled_summary_df = summary_df.style.apply(highlight_max, axis=0).set_table_styles(
    [{'selector': 'thead th',
      'props': [('background-color', '#eac086'),
                ('color', 'black'),
                ('font-weight', 'bold')]},
     {'selector': 'td',
      'props': [('padding', '10px'),
                ('background-color', '#000000'),
                ('color', 'white')]},
     {'selector': 'table',
      'props': [('border-collapse', 'collapse'),
                ('width', '50%')]},
     {'selector': 'tr:nth-of-type(even)',
      'props': [('background-color', '#f9f9f9')]},
     {'selector': 'tr:nth-of-type(odd)',
      'props': [('background-color', '#ffffff')]}]
)

# Display the styled DataFrame
styled_summary_df
Out[14]:
  Dataset Shape
0 Training Features (191, 10)
1 Test Features (48, 10)
2 Training Labels (191,)
3 Test Labels (48,)
In [82]:
X_train_df = pd.DataFrame(X_train)
X_test_df = pd.DataFrame(X_test)
y_train_df = pd.DataFrame(y_train)
y_test_df = pd.DataFrame(y_test)

X_train_df.to_csv(r'C:\Users\HP\Downloads\X_train.csv', index=False)
X_test_df.to_csv(r'C:\Users\HP\Downloads\X_test.csv', index=False)
y_train_df.to_csv(r'C:\Users\HP\Downloads\y_train.csv', index=False)
y_test_df.to_csv(r'C:\Users\HP\Downloads\y_test.csv', index=False)

👉 | Applying Feature Scaling

In [15]:
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Fit the scaler to the training data and transform both the training and testing data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

5 | Model Building¶

👉 | Metric Used: R2_score

In [63]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV

# Define the chosen models
chosen_models = {
    'XGBRegressor': XGBRegressor(),
    'RandomForestRegressor': RandomForestRegressor(),
    'GradientBoostingRegressor': GradientBoostingRegressor()
}

# Create a DataFrame to display model details
model_names = list(chosen_models.keys())
model_instances = [model.__class__.__name__ for model in chosen_models.values()]
model_data = {
    'Model Name': model_names,
    'Model Instance': model_instances
}

model_df = pd.DataFrame(model_data)

def style_model_df(df):
    return df.style.set_table_styles(
        [{'selector': 'thead th',
          'props': [('background-color', '#000000'),
                    ('color', '#eac086'),
                    ('font-weight', 'bold')]},
         {'selector': 'td',
          'props': [('padding', '10px'),
                    ('background-color', '#eac086'),
                    ('color', 'black')]},
         {'selector': 'table',
          'props': [('border-collapse', 'collapse'),
                    ('width', '60%')]},
         {'selector': 'tr:nth-of-type(even)',
          'props': [('background-color', '#f9f9f9')]},
         {'selector': 'tr:nth-of-type(odd)',
          'props': [('background-color', '#ffffff')]},
         {'selector': 'th:first-child, td:first-child',
          'props': [('border-right', '3px solid #000000')]},  
         {'selector': 'tr',
          'props': [('border-bottom', '3px solid #000000')]}  # Add a horizontal line after each row
         ]
    ).set_properties(**{'text-align': 'left'}).hide(axis='index')
    
# Apply styling to the DataFrame
styled_model_df = style_model_df(model_df)

# Display the styled DataFrame
styled_model_df
Out[63]:
Model Name Model Instance
XGBRegressor XGBRegressor
RandomForestRegressor RandomForestRegressor
GradientBoostingRegressor GradientBoostingRegressor
In [ ]:
 
In [98]:
# Parameter grid for XGBRegressor
param_grid_xgb = {
    'n_estimators': [50, 100, 150, 200],
    'learning_rate': [0.001, 0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5, 6],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

# Parameter grid for RandomForestRegressor
param_grid_rf = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Parameter grid for GradientBoostingRegressor
param_grid_gb = {
    'n_estimators': [50, 100, 150, 200],
    'learning_rate': [0.001, 0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5, 6],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'subsample': [0.8, 0.9, 1.0]
}

# Create model instances
xgb = XGBRegressor()
rf = RandomForestRegressor()
gb = GradientBoostingRegressor()

# Perform GridSearchCV for each model
grid_xgb = GridSearchCV(estimator=xgb, param_grid=param_grid_xgb, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
grid_rf = GridSearchCV(estimator=rf, param_grid=param_grid_rf, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
grid_gb = GridSearchCV(estimator=gb, param_grid=param_grid_gb, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
In [99]:
grid_xgb.fit(X_train, y_train)
Out[99]:
GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, device=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False, eval_metric=None,
                                    feature_types=None, gamma=None,
                                    grow_policy=None, importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=None, m...
                                    monotone_constraints=None,
                                    multi_strategy=None, n_estimators=None,
                                    n_jobs=None, num_parallel_tree=None,
                                    random_state=None, ...),
             n_jobs=-1,
             param_grid={'colsample_bytree': [0.6, 0.8, 1.0],
                         'learning_rate': [0.001, 0.01, 0.1, 0.2],
                         'max_depth': [3, 4, 5, 6],
                         'min_child_weight': [1, 3, 5],
                         'n_estimators': [50, 100, 150, 200],
                         'subsample': [0.6, 0.8, 1.0]},
             scoring='neg_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, device=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False, eval_metric=None,
                                    feature_types=None, gamma=None,
                                    grow_policy=None, importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=None, m...
                                    monotone_constraints=None,
                                    multi_strategy=None, n_estimators=None,
                                    n_jobs=None, num_parallel_tree=None,
                                    random_state=None, ...),
             n_jobs=-1,
             param_grid={'colsample_bytree': [0.6, 0.8, 1.0],
                         'learning_rate': [0.001, 0.01, 0.1, 0.2],
                         'max_depth': [3, 4, 5, 6],
                         'min_child_weight': [1, 3, 5],
                         'n_estimators': [50, 100, 150, 200],
                         'subsample': [0.6, 0.8, 1.0]},
             scoring='neg_mean_squared_error')
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...)
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...)
In [100]:
# Fit GridSearchCV
grid_rf.fit(X_train, y_train)
Out[100]:
GridSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'bootstrap': [True, False],
                         'max_depth': [None, 10, 20, 30, 40],
                         'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [50, 100, 150, 200]},
             scoring='neg_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'bootstrap': [True, False],
                         'max_depth': [None, 10, 20, 30, 40],
                         'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [50, 100, 150, 200]},
             scoring='neg_mean_squared_error')
RandomForestRegressor()
RandomForestRegressor()
In [101]:
# Fit GridSearchCV
grid_gb.fit(X_train, y_train)
Out[101]:
GridSearchCV(cv=5, estimator=GradientBoostingRegressor(), n_jobs=-1,
             param_grid={'learning_rate': [0.001, 0.01, 0.1, 0.2],
                         'max_depth': [3, 4, 5, 6],
                         'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [50, 100, 150, 200],
                         'subsample': [0.8, 0.9, 1.0]},
             scoring='neg_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=GradientBoostingRegressor(), n_jobs=-1,
             param_grid={'learning_rate': [0.001, 0.01, 0.1, 0.2],
                         'max_depth': [3, 4, 5, 6],
                         'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [50, 100, 150, 200],
                         'subsample': [0.8, 0.9, 1.0]},
             scoring='neg_mean_squared_error')
GradientBoostingRegressor()
GradientBoostingRegressor()
In [31]:
 
In [102]:
results = {
    'Model': ['XGBRegressor', 'RandomForestRegressor', 'GradientBoostingRegressor'],
    'Best Parameters': [grid_xgb.best_params_, grid_rf.best_params_, grid_gb.best_params_],
    'Best Score': [grid_xgb.best_score_, grid_rf.best_score_, grid_gb.best_score_]
}

results_df = pd.DataFrame(results)

# Define a function to style the DataFrame with more details
def style_results_df(df):
    return df.style.set_table_styles(
        [{'selector': 'thead th',
          'props': [('background-color', '#eac086'),
                    ('color', 'black'),
                    ('font-weight', 'bold'),
                    ('text-align', 'center')]},
         {'selector': 'td',
          'props': [('padding', '10px'),
                    ('background-color', '#000000'),
                    ('color', 'white'),
                    ('text-align', 'center')]},
         {'selector': 'table',
          'props': [('border-collapse', 'collapse'),
                    ('width', '80%'),
                    ('margin', '20px auto')]},
         {'selector': 'tr:nth-of-type(even)',
          'props': [('background-color', '#f9f9f9')]},
         {'selector': 'tr:nth-of-type(odd)',
          'props': [('background-color', '#ffffff')]}]
    ).set_properties(**{'text-align': 'center'}).hide(axis='index')

# Apply styling to the results DataFrame
styled_results_df = style_results_df(results_df)

# Display the styled DataFrame
styled_results_df
Out[102]:
Model Best Parameters Best Score
XGBRegressor {'colsample_bytree': 0.6, 'learning_rate': 0.2, 'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 200, 'subsample': 1.0} -0.011358
RandomForestRegressor {'bootstrap': True, 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50} -0.010256
GradientBoostingRegressor {'learning_rate': 0.1, 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 200, 'subsample': 0.8} -0.010351
In [105]:
# Print the best parameters of each model
print("Best parameters for XGBRegressor:")
print(grid_xgb.best_params_)
print("\nBest parameters for RandomForestRegressor:")
print(grid_rf.best_params_)
print("\nBest parameters for GradientBoostingRegressor:")
print(grid_gb.best_params_)

# Define the models with the best hyperparameters
xgb_params = grid_xgb.best_params_
rf_params = grid_rf.best_params_
gb_params = grid_gb.best_params_
Best parameters for XGBRegressor:
{'colsample_bytree': 0.6, 'learning_rate': 0.2, 'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 200, 'subsample': 1.0}

Best parameters for RandomForestRegressor:
{'bootstrap': True, 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}

Best parameters for GradientBoostingRegressor:
{'learning_rate': 0.1, 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 200, 'subsample': 0.8}
In [106]:
# Define the models with the best hyperparameters
xgb_model = XGBRegressor(**xgb_params)
rf_model = RandomForestRegressor(**rf_params)
gb_model = GradientBoostingRegressor(**gb_params)

# Fit the models
xgb_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)
y_pred_rf = rf_model.predict(X_test)
y_pred_gb = gb_model.predict(X_test)
In [ ]:
 
In [107]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error, median_absolute_error, explained_variance_score
import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

def compute(model, X_train, y_train, X_test, y_test, hashmap):
    """
    Train the model, make predictions, and compute evaluation metrics.
    
    Parameters:
    - model: The machine learning model to be evaluated.
    - X_train: Training features.
    - y_train: Training labels.
    - X_test: Testing features.
    - y_test: Testing labels.
    - hashmap: Dictionary to store the model name and metrics.
    
    Returns:
    - None
    """
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Compute metrics
    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    mape = mean_absolute_percentage_error(y_test, y_pred)
    medae = median_absolute_error(y_test, y_pred)
    evs = explained_variance_score(y_test, y_pred)
    
    # Update the hashmap with metrics
    model_name = str(model).split('(')[0]  # Remove parameters from model name
    hashmap[model_name] = {
        'R^2': r2,
        'RMSE': rmse,
        'MAE': mae,
        'MAPE': mape,
        'Median AE': medae,
        'Explained Variance Score': evs
    }

# Example usage with models
hashmap = {}

# Replace with your model instances
models = {
    'XGBRegressor': XGBRegressor(),
    'RandomForestRegressor': RandomForestRegressor(),
    'GradientBoostingRegressor': GradientBoostingRegressor()
}

# Evaluate each model
for name, model in models.items():
    compute(model, X_train, y_train, X_test, y_test, hashmap)

# Convert hashmap to DataFrame for better presentation
results_df = pd.DataFrame.from_dict(hashmap, orient='index').reset_index()
results_df.rename(columns={'index': 'Model'}, inplace=True)

# Style the results DataFrame
def style_results_df(df):
    return df.style.set_table_styles(
        [{'selector': 'thead th',
          'props': [('background-color', '#eac086'),
                    ('color', 'black'),
                    ('font-weight', 'bold'),
                    ('text-align', 'center'),
                    ('font-size', '14px')]},
         {'selector': 'td',
          'props': [('padding', '10px'),
                    ('background-color', '#000000'),
                    ('color', '#eac086'),
                    ('text-align', 'center'),
                    ('font-size', '12px')]},
         {'selector': 'table',
          'props': [('border-collapse', 'collapse'),
                    ('width', '80%'),
                    ('margin', '20px auto'),
                    ('border', '2px solid #000000')]},
         {'selector': 'tr:nth-of-type(even)',
          'props': [('background-color', '#f9f9f9')]},
         {'selector': 'tr:nth-of-type(odd)',
          'props': [('background-color', '#ffffff')]}]
    ).set_properties(**{'text-align': 'center'}).hide(axis='index')

# Apply styling to the results DataFrame
styled_results_df = style_results_df(results_df)

# Display the styled DataFrame
styled_results_df
Out[107]:
Model R^2 RMSE MAE MAPE Median AE Explained Variance Score
XGBRegressor 0.989803 0.047240 0.022924 0.008950 0.009602 0.989864
RandomForestRegressor 0.980255 0.065738 0.026357 0.010270 0.007266 0.980640
GradientBoostingRegressor 0.987876 0.051511 0.022400 0.008947 0.004594 0.987943
In [ ]:
 
In [108]:
import pandas as pd

# Create a DataFrame for comparison
comparison_df = pd.DataFrame({
    'Actual Body_Fat': y_test,
    'XGB Predicted': y_pred_xgb,
    'RandomForest Predicted': y_pred_rf,
    'GradientBoosting Predicted': y_pred_gb
})

# Style the DataFrame
def style_comparison_df(df):
    return df.style.set_table_styles(
        [{'selector': 'thead th',
          'props': [('background-color', '#000000'),
                    ('color', 'white'),
                    ('font-weight', 'bold'),
                    ('text-align', 'center'),
                    ('font-size', '18px')]},  # Increased font size to 18px
         {'selector': 'td',
          'props': [('padding', '10px'),
                    ('background-color', '#eac086 '),
                    ('color', 'black'),
                    ('text-align', 'center'),
                    ('font-size', '14px')]},  # Increased font size to 14px
         {'selector': 'table',
          'props': [('border-collapse', 'collapse'),
                    ('width', '80%'),
                    ('margin', '20px auto'),
                    ('border', '2px solid #000000')]},
         {'selector': 'tr:nth-of-type(even)',
          'props': [('background-color', '#f9f9f9')]},
         {'selector': 'tr:nth-of-type(odd)',
          'props': [('background-color', '#ffffff')]},
         {'selector': 'th, td',
          'props': [('border-right', '1px solid #000000')]},  # Added vertical lines
         {'selector': 'th:first-child, td:first-child',
          'props': [('border-left', '1px solid #000000')]},  # Added vertical lines
         {'selector': 'th:last-child, td:last-child',
          'props': [('border-right', '1px solid #000000')]},  # Added vertical lines
    ]).set_properties(**{'text-align': 'center'}).hide(axis='index')

# Apply styling to the results DataFrame
styled_comparison_df = style_comparison_df(comparison_df)

# Display the styled DataFrame
styled_comparison_df
Out[108]:
Actual Body_Fat XGB Predicted RandomForest Predicted GradientBoosting Predicted
1.740466 1.923942 1.801761 1.773055
2.667228 2.683122 2.646584 2.632619
3.186353 3.209780 3.184591 3.183087
2.928524 2.930230 2.920177 2.925560
3.077312 3.063828 3.067740 3.072026
2.602690 2.617278 2.603371 2.619777
3.374169 3.340529 3.377351 3.363605
2.351375 2.358820 2.367517 2.376295
2.208274 2.186350 2.187794 2.162981
3.303217 3.326757 3.305812 3.305484
2.151762 2.188141 2.107588 2.105292
2.251292 2.298088 2.256393 2.362219
3.397858 3.383094 3.403739 3.395556
2.714695 2.639407 2.710762 2.697947
3.000720 2.982914 2.999048 3.003940
3.610918 3.610105 3.600507 3.615493
3.314186 3.288378 3.308686 3.306841
3.433987 3.423511 3.430143 3.429656
2.856470 2.871415 2.843550 2.841665
3.206803 3.216308 3.202033 3.200899
3.335770 3.293461 3.329212 3.330851
3.049273 3.052361 3.051418 3.055137
3.471966 3.486433 3.470359 3.471233
2.624669 2.545546 2.529972 2.475943
2.360854 2.405570 2.380210 2.303955
2.533697 2.625536 2.553749 2.558353
2.282382 2.421396 2.418296 2.476196
2.980619 3.003736 2.982452 2.984478
3.443618 3.462661 3.460553 3.486458
2.433613 2.356863 2.429167 2.291836
3.095578 3.100104 3.099773 3.095440
2.674149 2.597281 2.577352 2.466882
2.292535 2.288321 2.245968 2.240487
2.884801 2.924354 2.896750 2.889850
2.917771 2.953461 2.921842 2.918005
3.020425 2.666239 2.709060 2.726274
1.974081 1.984332 2.035196 2.065408
3.459466 3.421786 3.461100 3.457072
3.353407 3.358920 3.357636 3.351748
3.549617 3.515830 3.543443 3.511788
2.667228 2.660399 2.653350 2.613112
3.325036 3.315657 3.326879 3.322874
2.484907 2.601023 2.499396 2.506492
3.214868 3.208258 3.214617 3.215482
2.839078 2.765460 2.784825 2.823285
2.351375 2.357627 2.330183 2.408689
2.939162 2.923510 2.941383 2.935352
3.526361 3.483199 3.524351 3.512623
In [110]:
import matplotlib.pyplot as plt

def plot_actual_vs_predicted(comparison_df):
    """
    Plots actual vs predicted values for all models with detailed styling.
    Uses black background and '#eac086' color for plotting.

    Parameters:
    - comparison_df: DataFrame with actual vs predicted values.
    """
    # Limit to the first 10 values for plotting
    comparison_df = comparison_df.head(10)
    
    plt.figure(figsize=(18, 6))
    plt.style.use('dark_background')  # Use a dark background for the plots

    # Colors for each model
    colors = ['#eac086', '#eac086', '#eac086']  # Same color for all models as requested

    # Iterate through each model prediction to create subplots
    for i, model in enumerate(['XGB Predicted', 'RandomForest Predicted', 'GradientBoosting Predicted']):
        plt.subplot(1, 3, i + 1)
        plt.scatter(comparison_df['Actual Body_Fat'], comparison_df[model], 
                    color=colors[i], alpha=0.7, edgecolor='black', label=f'{model} Predictions', s=100)  # Increase scatter size
        plt.plot([comparison_df['Actual Body_Fat'].min(), comparison_df['Actual Body_Fat'].max()],
                 [comparison_df['Actual Body_Fat'].min(), comparison_df['Actual Body_Fat'].max()], 
                 '--', lw=3, color='white', label='Perfect Fit')  # Increase line width and set color
        plt.xlabel('Actual Body Fat Percentage', fontsize=12)
        plt.ylabel('Predicted Body Fat Percentage', fontsize=12)
        plt.title(f'{model}: Actual vs Predicted', fontsize=14, color='white')
        plt.legend()

        # Add value labels for each point
        for j in range(len(comparison_df)):
            plt.text(comparison_df['Actual Body_Fat'].iloc[j], comparison_df[model].iloc[j], 
                     f'{comparison_df[model].iloc[j]:.2f}', fontsize=10, color='white')

    plt.tight_layout()
    plt.show()

# Plot the comparison with the updated plotting function
plot_actual_vs_predicted(comparison_df)
No description has been provided for this image
In [90]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Data for plotting
results_df = pd.DataFrame({
    'Model': ['XGBRegressor', 'RandomForestRegressor', 'GradientBoostingRegressor'],
    'R^2': [0.989803, 0.980559, 0.983830],
    'RMSE': [0.047240, 0.065228, 0.059489],
    'MAE': [0.022924, 0.026300, 0.025341],
    'MAPE': [0.008950, 0.010465, 0.010254],
    'Median AE': [0.009602, 0.006151, 0.005138],
    'Explained Variance Score': [0.989864, 0.980954, 0.983907]
})

# Set the style
plt.style.use('dark_background')
sns.set_palette(sns.color_palette(['#eac086', '#ffcd94', '#ffad60']))

# Define metrics
metrics = ['R^2', 'RMSE', 'MAE', 'MAPE', 'Median AE', 'Explained Variance Score']

# Create and save plots for each metric
for metric in metrics:
    plt.figure(figsize=(12, 8))
    
    ax = sns.barplot(x='Model', y=metric, data=results_df)
    
    # Add data labels with white text
    for container in ax.containers:
        ax.bar_label(container, fmt='%.4f', fontsize=12, color='white')
    
    # Set labels, title, and subtitle
    plt.title(f'{metric} Comparison', fontsize=18, color='white')
    plt.suptitle(f'Comparison of Models based on {metric}', fontsize=14, color='white', y=0.94)
    plt.xlabel('Model', fontsize=14, color='white')
    plt.ylabel(metric, fontsize=14, color='white')
    
    # Set background color
    plt.gca().set_facecolor('#000000')
    
    # Save the figure
    plt.savefig(f'{metric}_comparison.png', bbox_inches='tight')
    
    # Show the plot
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [ ]:
 

🏆 Best ML Model: XGBRegressor 🏆¶

After a detailed comparison of the three models—XGBRegressor, RandomForestRegressor, and GradientBoostingRegressor—the XGBRegressor is declared the winner based on the following reasoning:

1. Highest R² Score¶

  • XGBRegressor has the highest R² Score of 0.9898, indicating that it explains approximately 98.98% of the variance in the data. This is marginally higher than the GradientBoostingRegressor (0.9879) and significantly better than the RandomForestRegressor (0.9803), demonstrating superior predictive power.

2. Lowest RMSE (Root Mean Squared Error)¶

  • The RMSE for XGBRegressor is 0.0472, which is the lowest among the three models. This reflects that its predictions are closer to the actual values, indicating a high degree of accuracy. In contrast, GradientBoostingRegressor has a slightly higher RMSE of 0.0515, and RandomForestRegressor has a much higher RMSE of 0.0657.

3. Competitive MAE (Mean Absolute Error) and MAPE (Mean Absolute Percentage Error)¶

  • The XGBRegressor achieves a MAE of 0.0229 and MAPE of 0.00895, which are very competitive. Although the GradientBoostingRegressor slightly edges out with a lower MAE of 0.0224 and an equal MAPE of 0.00895, the difference is minimal. The RandomForestRegressor has higher errors (MAE of 0.0264 and MAPE of 0.0103), indicating less accurate predictions.

4. Robustness in Median Absolute Error (Median AE)¶

  • The XGBRegressor has a Median AE of 0.0096, which is marginally higher than the GradientBoostingRegressor (0.0046) but still indicates strong robustness. The RandomForestRegressor has a Median AE of 0.0073, but it does not compensate for its lower performance in other metrics.

5. Highest Explained Variance Score¶

  • The Explained Variance Score for XGBRegressor is 0.9899, the highest among all three models. This score further validates that the XGBRegressor captures the underlying variance in the data most effectively, leading to more reliable predictions.

Conclusion:¶

While all three models perform well, the XGBRegressor stands out due to its combination of the highest R² score, lowest RMSE, competitive MAE and MAPE, and the highest explained variance score. These factors together make it the most robus Explained Variance Score of 0.9899, indicating a good fit to the data.

In [ ]:
 
In [ ]:
 
In [91]:
import pickle
import xgboost as xgb

# Assume xgb_model is already defined and trained
xgb_model = xgb.XGBRegressor(**xgb_params)

# Train the model
xgb_model.fit(X_train, y_train)

# Save the model to a file
with open('xgb_model.pkl', 'wb') as file:
    pickle.dump(xgb_model, file)

print("XGBRegressor model saved successfully!")
XGBRegressor model saved successfully!
In [102]:
def predict_body_fat(user_input):
    # Define the expected features
    expected_features = [
        'Density_transformed', 'Age_transformed', 'BMI', 'WaistToHipRatio',
        'BodySurfaceArea', 'AgeSquared', 'AbdomenToChestRatio', 'UpperBodyFat',
        'LowerBodyFat', 'ArmFatIndex'
    ]
    
    # Create a DataFrame with the user input
    input_df = pd.DataFrame([user_input])
    
    # Ensure the DataFrame has all expected features
    input_df = input_df.reindex(columns=expected_features, fill_value=0)
    
    # Make prediction
    prediction = loaded_xgb_model.predict(input_df)[0]
    
    return f"Predicted Body Fat Percentage: {prediction:.2f}"
In [104]:
# Load the model from the file
import pickle

with open('xgb_model.pkl', 'rb') as file:
    loaded_xgb_model = pickle.load(file)

print("XGBRegressor model loaded successfully!")

def get_user_input():
    user_input = {}
    required_features = [
        'Density_transformed', 'Age_transformed', 'BMI', 'WaistToHipRatio',
        'BodySurfaceArea', 'AgeSquared', 'AbdomenToChestRatio', 'UpperBodyFat',
        'LowerBodyFat', 'ArmFatIndex'
    ]
    
    print("Please provide the following information:")
    
    for feature in required_features:
        while True:
            try:
                value = input(f"Enter value for {feature} (numeric, e.g., 1.23): ")
                
                # Additional validation (e.g., check for realistic ranges, positive values)
                value = float(value)
                if value < 0:
                    print(f"Value for {feature} cannot be negative. Please enter a positive number.")
                    continue
                
                user_input[feature] = value
                break
            except ValueError:
                print(f"Invalid input. Please enter a valid numerical value for {feature}.")
    
    return user_input


# Collect user input
user_input = get_user_input()

if user_input:
    # Predict and display the result
    result = predict_body_fat(user_input)
    print(result)
else:
    print("User input was not valid. Please try again.")
XGBRegressor model loaded successfully!
Please provide the following information:
Predicted Body Fat Percentage: 2.64

Deep Learning¶

In [ ]:
 
In [2]:
import tensorflow as tf
from tensorflow.keras.layers import Dropout
from tensorflow.keras.regularizers import l1_l2
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
In [3]:
# Define Simple Feedforward Neural Network (FNN)
def create_fnn_model(input_dim):
    model = Sequential()
    model.add(Dense(16, input_dim=input_dim, activation='relu', kernel_regularizer=l1_l2(l1=0.01, l2=0.01)))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(1))  # Output layer for regression
    model.compile(optimizer=Adam(), loss='mean_squared_error')
    return model

# Define Deep Neural Network (DNN) with Dropout
def create_dnn_model_with_dropout(input_dim):
    model = Sequential()
    model.add(Dense(64, input_dim=input_dim, activation='relu'))
    model.add(Dropout(0.3))  # Dropout layer with 30% dropout rate
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(1))  # Output layer for regression
    model.compile(optimizer=Adam(), loss='mean_squared_error')
    return model
In [ ]:
 
In [4]:
import pandas as pd

# Summary DataFrame for deep learning models
model_summary = {
    'Model Name': ['Feedforward Neural Network (FNN)', 'Deep Neural Network with Dropout'],
    'Architecture': ['2 hidden layers (16, 8 units)', '2 hidden layers (64, 32 units) with Dropout'],
    'Activation': ['ReLU', 'ReLU'],
    'Dropout': ['None', 'Dropout (0.3)'],
    'Optimizer': ['Adam', 'Adam'],
    'Loss Function': ['Mean Squared Error', 'Mean Squared Error'],
    'Early Stopping': ['Patience: 5, Min Delta: 0.001', 'Patience: 5, Min Delta: 0.001'],
    'Epochs': ['100', '100'],
    'Batch Size': ['8', '8']
}

summary_df = pd.DataFrame(model_summary)

# Define a function to style the DataFrame
def style_summary_df(df):
    return df.style.set_table_styles(
        [{'selector': 'thead th',
          'props': [('background-color', '#eac086'),
                    ('color', 'black'),
                    ('font-weight', 'bold')]},
         {'selector': 'td',
          'props': [('padding', '10px'),
                    ('background-color', '#000000'),
                    ('color', 'white')]},
         {'selector': 'table',
          'props': [('border-collapse', 'collapse'),
                    ('width', '80%')]},
         {'selector': 'tr:nth-of-type(even)',
          'props': [('background-color', '#f9f9f9')]},
         {'selector': 'tr:nth-of-type(odd)',
          'props': [('background-color', '#ffffff')]}]
    ).set_properties(**{'text-align': 'left'}).hide(axis='index')

# Apply styling to the DataFrame
styled_summary_df = style_summary_df(summary_df)

# Display the styled DataFrame
styled_summary_df
Out[4]:
Model Name Architecture Activation Dropout Optimizer Loss Function Early Stopping Epochs Batch Size
Feedforward Neural Network (FNN) 2 hidden layers (16, 8 units) ReLU None Adam Mean Squared Error Patience: 5, Min Delta: 0.001 100 8
Deep Neural Network with Dropout 2 hidden layers (64, 32 units) with Dropout ReLU Dropout (0.3) Adam Mean Squared Error Patience: 5, Min Delta: 0.001 100 8
In [ ]:
 
In [21]:
 
In [22]:
# Create and train the FNN model
fnn_model = create_fnn_model(X_train.shape[1])
fnn_model.summary()
Model: "sequential_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_3 (Dense)                      │ (None, 16)                  │             176 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_4 (Dense)                      │ (None, 8)                   │             136 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_5 (Dense)                      │ (None, 1)                   │               9 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 321 (1.25 KB)
 Trainable params: 321 (1.25 KB)
 Non-trainable params: 0 (0.00 B)
In [31]:
class R2ScoreCallback(Callback):
    def __init__(self, training_data, validation_data):
        super().__init__()
        self.training_data = training_data
        self.validation_data = validation_data

    def on_epoch_end(self, epoch, logs=None):
        # Calculate R² score on the training data
        X_train, y_train = self.training_data
        y_train_pred = self.model.predict(X_train, verbose=0)
        train_r2 = r2_score(y_train, y_train_pred)

        # Calculate R² score on the validation data
        X_val, y_val = self.validation_data
        y_val_pred = self.model.predict(X_val, verbose=0)
        val_r2 = r2_score(y_val, y_val_pred)

        # Log R² scores
        print(f"Epoch {epoch + 1}: Train R2 score = {train_r2:.4f}, Validation R2 score = {val_r2:.4f}")
In [88]:
# Create training and validation sets
X_train_new, X_val, y_train_new, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)
In [34]:
# Initialize the custom R2 callback with the training and validation data
r2_callback = R2ScoreCallback(training_data=(X_train_new, y_train_new), validation_data=(X_val, y_val))


early_stopping = EarlyStopping(
    monitor='val_loss',                # Monitor the validation loss
    patience=10,                       # Number of epochs to wait for improvement (increased to allow more time for potential improvements)
    min_delta=0.0005,                  # Minimum change to qualify as an improvement (more sensitive to smaller changes)
    restore_best_weights=True,         # Restore model weights to the best observed during training
    verbose=1,                         # Provide detailed logs about early stopping events
    mode='min'                         # Mode for monitoring ('min' since we're monitoring loss which we want to minimize)
)
In [ ]:
 
In [35]:
# Train the model with the custom callback
history_fnn = fnn_model.fit(
    X_train_new, y_train_new,
    epochs=150,
    batch_size=8,
    validation_data=(X_val, y_val),
    verbose=1,
    callbacks=[early_stopping, r2_callback]
)
Epoch 1/150
15/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0148 Epoch 1: Train R2 score = 0.9300, Validation R2 score = 0.9665
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 29ms/step - loss: 0.0166 - val_loss: 0.0160
Epoch 2/150
16/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0197  Epoch 2: Train R2 score = 0.9344, Validation R2 score = 0.9687
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 23ms/step - loss: 0.0201 - val_loss: 0.0156
Epoch 3/150
18/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0151 Epoch 3: Train R2 score = 0.9320, Validation R2 score = 0.9678
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 25ms/step - loss: 0.0163 - val_loss: 0.0156
Epoch 4/150
17/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0188 Epoch 4: Train R2 score = 0.9313, Validation R2 score = 0.9664
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 23ms/step - loss: 0.0187 - val_loss: 0.0159
Epoch 5/150
14/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0186 Epoch 5: Train R2 score = 0.9182, Validation R2 score = 0.9463
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 24ms/step - loss: 0.0202 - val_loss: 0.0209
Epoch 6/150
18/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0168 Epoch 6: Train R2 score = 0.9342, Validation R2 score = 0.9666
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 22ms/step - loss: 0.0177 - val_loss: 0.0160
Epoch 7/150
18/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0166 Epoch 7: Train R2 score = 0.9311, Validation R2 score = 0.9684
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 24ms/step - loss: 0.0172 - val_loss: 0.0154
Epoch 8/150
14/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0199 Epoch 8: Train R2 score = 0.9260, Validation R2 score = 0.9638
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 24ms/step - loss: 0.0209 - val_loss: 0.0165
Epoch 9/150
16/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0168 Epoch 9: Train R2 score = 0.9329, Validation R2 score = 0.9682
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 22ms/step - loss: 0.0182 - val_loss: 0.0153
Epoch 10/150
14/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0225 Epoch 10: Train R2 score = 0.9319, Validation R2 score = 0.9675
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 24ms/step - loss: 0.0225 - val_loss: 0.0153
Epoch 11/150
17/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0197 Epoch 11: Train R2 score = 0.9310, Validation R2 score = 0.9691
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 23ms/step - loss: 0.0201 - val_loss: 0.0148
Epoch 12/150
17/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0146 Epoch 12: Train R2 score = 0.9337, Validation R2 score = 0.9704
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 23ms/step - loss: 0.0163 - val_loss: 0.0147
Epoch 13/150
14/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0267 Epoch 13: Train R2 score = 0.9339, Validation R2 score = 0.9709
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 47ms/step - loss: 0.0245 - val_loss: 0.0144
Epoch 14/150
14/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0269 Epoch 14: Train R2 score = 0.9323, Validation R2 score = 0.9701
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 23ms/step - loss: 0.0248 - val_loss: 0.0144
Epoch 15/150
15/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0394 Epoch 15: Train R2 score = 0.9317, Validation R2 score = 0.9675
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 25ms/step - loss: 0.0333 - val_loss: 0.0150
Epoch 16/150
17/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0219 Epoch 16: Train R2 score = 0.9336, Validation R2 score = 0.9704
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 24ms/step - loss: 0.0214 - val_loss: 0.0142
Epoch 17/150
17/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0182  Epoch 17: Train R2 score = 0.9331, Validation R2 score = 0.9693
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 27ms/step - loss: 0.0188 - val_loss: 0.0148
Epoch 18/150
15/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0125 Epoch 18: Train R2 score = 0.9331, Validation R2 score = 0.9693
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 23ms/step - loss: 0.0154 - val_loss: 0.0144
Epoch 19/150
15/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0156  Epoch 19: Train R2 score = 0.9326, Validation R2 score = 0.9672
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 24ms/step - loss: 0.0169 - val_loss: 0.0147
Epoch 20/150
17/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0285 Epoch 20: Train R2 score = 0.9327, Validation R2 score = 0.9709
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 23ms/step - loss: 0.0262 - val_loss: 0.0139
Epoch 21/150
18/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0190 Epoch 21: Train R2 score = 0.9309, Validation R2 score = 0.9677
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 23ms/step - loss: 0.0193 - val_loss: 0.0146
Epoch 22/150
14/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0139  Epoch 22: Train R2 score = 0.9328, Validation R2 score = 0.9712
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 25ms/step - loss: 0.0158 - val_loss: 0.0137
Epoch 23/150
20/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0144 Epoch 23: Train R2 score = 0.9310, Validation R2 score = 0.9689
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 23ms/step - loss: 0.0151 - val_loss: 0.0142
Epoch 24/150
15/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0171 Epoch 24: Train R2 score = 0.9331, Validation R2 score = 0.9679
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 24ms/step - loss: 0.0180 - val_loss: 0.0143
Epoch 25/150
15/22 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 0.0119  Epoch 25: Train R2 score = 0.9313, Validation R2 score = 0.9698
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 26ms/step - loss: 0.0145 - val_loss: 0.0140
Epoch 26/150
17/22 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 0.0191 Epoch 26: Train R2 score = 0.9301, Validation R2 score = 0.9661
22/22 ━━━━━━━━━━━━━━━━━━━━ 1s 23ms/step - loss: 0.0193 - val_loss: 0.0148
Epoch 26: early stopping
Restoring model weights from the end of the best epoch: 16.
In [ ]:
 
In [37]:
# Create and train the DNN model with Dropout
dnn_model = create_dnn_model_with_dropout(X_train.shape[1])
dnn_model.summary()
Model: "sequential_6"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense_15 (Dense)                     │ (None, 64)                  │             704 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_6 (Dropout)                  │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_16 (Dense)                     │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_7 (Dropout)                  │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_17 (Dense)                     │ (None, 1)                   │              33 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 2,817 (11.00 KB)
 Trainable params: 2,817 (11.00 KB)
 Non-trainable params: 0 (0.00 B)
In [ ]:
early_stopping = EarlyStopping(
    monitor='val_loss',                # Monitor the validation loss
    patience=10,                       # Number of epochs to wait for improvement (increased to allow more time for potential improvements)
    min_delta=0.0005,                  # Minimum change to qualify as an improvement (more sensitive to smaller changes)
    restore_best_weights=True,         # Restore model weights to the best observed during training
    verbose=1,                         # Provide detailed logs about early stopping events
    mode='min'                         # Mode for monitoring ('min' since we're monitoring loss which we want to minimize)
)

history_dnn = dnn_model_with_dropout.fit(X_train, y_train, epochs=100, batch_size=8, validation_split=0.1, verbose=1, callbacks=[early_stopping])
In [41]:
# Train the model with the custom callback
history_dnn = dnn_model_with_dropout.fit(
    X_train_new, y_train_new,
    epochs=150,
    batch_size=16,
    validation_data=(X_val, y_val),
    verbose=2,
    callbacks=[early_stopping, r2_callback]
)
Epoch 1/150
Epoch 1: Train R2 score = 0.8694, Validation R2 score = 0.7881
11/11 - 1s - 55ms/step - loss: 0.1402 - val_loss: 0.0408
Epoch 2/150
Epoch 2: Train R2 score = 0.8905, Validation R2 score = 0.8043
11/11 - 1s - 51ms/step - loss: 0.1570 - val_loss: 0.0377
Epoch 3/150
Epoch 3: Train R2 score = 0.8641, Validation R2 score = 0.7802
11/11 - 0s - 45ms/step - loss: 0.1172 - val_loss: 0.0423
Epoch 4/150
Epoch 4: Train R2 score = 0.8756, Validation R2 score = 0.7860
11/11 - 1s - 46ms/step - loss: 0.1352 - val_loss: 0.0412
Epoch 5/150
Epoch 5: Train R2 score = 0.8736, Validation R2 score = 0.7890
11/11 - 1s - 48ms/step - loss: 0.1200 - val_loss: 0.0406
Epoch 6/150
Epoch 6: Train R2 score = 0.8623, Validation R2 score = 0.7836
11/11 - 1s - 53ms/step - loss: 0.1190 - val_loss: 0.0417
Epoch 7/150
Epoch 7: Train R2 score = 0.8539, Validation R2 score = 0.7749
11/11 - 1s - 51ms/step - loss: 0.1390 - val_loss: 0.0433
Epoch 8/150
Epoch 8: Train R2 score = 0.8518, Validation R2 score = 0.7757
11/11 - 1s - 55ms/step - loss: 0.1660 - val_loss: 0.0432
Epoch 9/150
Epoch 9: Train R2 score = 0.8556, Validation R2 score = 0.7843
11/11 - 1s - 53ms/step - loss: 0.1588 - val_loss: 0.0415
Epoch 10/150
Epoch 10: Train R2 score = 0.8813, Validation R2 score = 0.8114
11/11 - 1s - 52ms/step - loss: 0.1425 - val_loss: 0.0363
Epoch 10: early stopping
Restoring model weights from the end of the best epoch: 1.
In [ ]:
 
In [ ]:
 
In [76]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

def evaluate_regression_model(model, X_train, y_train, X_val, y_val):
    # Predict on training data
    y_train_pred = model.predict(X_train, verbose=0)
    
    # Calculate metrics on training data
    train_r2 = r2_score(y_train, y_train_pred)
    train_mae = mean_absolute_error(y_train, y_train_pred)
    train_mse = mean_squared_error(y_train, y_train_pred)
    train_rmse = np.sqrt(train_mse)

    # Predict on validation data
    y_val_pred = model.predict(X_val, verbose=0)
    
    # Calculate metrics on validation data
    val_r2 = r2_score(y_val, y_val_pred)
    val_mae = mean_absolute_error(y_val, y_val_pred)
    val_mse = mean_squared_error(y_val, y_val_pred)
    val_rmse = np.sqrt(val_mse)

    # Collect metrics into a dictionary
    results = {
        'Model': model.__class__.__name__,
        'Training R² Score': train_r2,
        'Training MAE': train_mae,
        'Training MSE': train_mse,
        'Training RMSE': train_rmse,
        'Validation R² Score': val_r2,
        'Validation MAE': val_mae,
        'Validation MSE': val_mse,
        'Validation RMSE': val_rmse
    }

    return results

def combine_and_style_results(results1, results2):
    # Convert results dictionaries to DataFrames
    df1 = pd.DataFrame([results1])
    df2 = pd.DataFrame([results2])
    
    # Combine DataFrames
    combined_df = pd.concat([df1, df2], ignore_index=True)
    
    # Add model names for clarity
    combined_df.index = ['FNN', 'DNN']
    
    # Style the DataFrame
    styled_df = combined_df.style.set_table_styles(
        [{'selector': 'thead th',
          'props': [('background-color', '#000000'),
                    ('color', 'white'),
                    ('font-weight', 'bold'),
                    ('text-align', 'center'),
                    ('font-size', '18px')]},  # Header styling
         {'selector': 'td',
          'props': [('padding', '10px'),
                    ('background-color', '#eac086'),
                    ('color', 'black'),
                    ('text-align', 'center'),
                    ('font-size', '14px')]},  # Data cell styling
         {'selector': 'table',
          'props': [('border-collapse', 'collapse'),
                    ('width', '80%'),
                    ('margin', '20px auto'),
                    ('border', '2px solid #000000')]},  # Table border
         {'selector': 'tr:nth-of-type(even)',
          'props': [('background-color', '#f9f9f9')]},  # Even row color
         {'selector': 'tr:nth-of-type(odd)',
          'props': [('background-color', '#ffffff')]},  # Odd row color
         {'selector': 'th, td',
          'props': [('border-right', '1px solid #000000')]},  # Vertical lines
         {'selector': 'th:first-child, td:first-child',
          'props': [('border-left', '1px solid #000000')]},  # Left border for first column
         {'selector': 'th:last-child, td:last-child',
          'props': [('border-right', '1px solid #000000')]},  # Right border for last column
    ]).set_properties(**{'text-align': 'center'}).hide(axis='index')

    return styled_df

# Example usage:
# Assuming X_train, y_train, X_val, y_val are already defined
fnn_results = evaluate_regression_model(fnn_model, X_train, y_train, X_val, y_val)
dnn_results = evaluate_regression_model(dnn_model_with_dropout, X_train, y_train, X_val, y_val)

# Combine and style results
styled_results_df = combine_and_style_results(fnn_results, dnn_results)

# Display the styled DataFrame
styled_results_df
Combined Evaluation Results:
Out[76]:
  Model Training R² Score Training MAE Training MSE Training RMSE Validation R² Score Validation MAE Validation MSE Validation RMSE
FNN Sequential 0.938242 0.040361 0.010024 0.100120 0.970382 0.035750 0.005704 0.075525
DNN Sequential 0.859452 0.118425 0.022813 0.151039 0.788094 0.158628 0.040810 0.202014
In [ ]:
 
In [87]:
import pandas as pd

def make_predictions(model, X_test):
    predictions = model.predict(X_test, verbose=0)
    return predictions

def compare_predictions(y_test, predictions1, predictions2, model_names):
    comparison_df = pd.DataFrame({
        'Actual': y_test,
        f'{model_names[0]} Prediction': predictions1.flatten(),
        f'{model_names[1]} Prediction': predictions2.flatten()
    })
    return comparison_df

def style_comparison_df(df):
    """
    Apply custom styles to the DataFrame with black background and white text for column headers,
    and enhanced styling for the rest of the table.
    """
    return df.style.set_table_styles(
        [{'selector': 'thead th',
          'props': [('background-color', '#000000'),
                    ('color', 'white'),
                    ('font-weight', 'bold'),
                    ('text-align', 'center'),
                    ('font-size', '18px')]},  # Increased font size to 18px for headers
         {'selector': 'td',
          'props': [('padding', '10px'),
                    ('background-color', '#eac086'),
                    ('color', 'black'),
                    ('text-align', 'center'),
                    ('font-size', '14px')]},  # Increased font size to 14px for data cells
         {'selector': 'table',
          'props': [('border-collapse', 'collapse'),
                    ('width', '80%'),
                    ('margin', '20px auto'),
                    ('border', '2px solid #000000')]},  # Table border settings
         {'selector': 'tr:nth-of-type(even)',
          'props': [('background-color', '#f9f9f9')]},  # Even row color
         {'selector': 'tr:nth-of-type(odd)',
          'props': [('background-color', '#ffffff')]},  # Odd row color
         {'selector': 'th, td',
          'props': [('border-right', '1px solid #000000')]},  # Vertical lines
         {'selector': 'th:first-child, td:first-child',
          'props': [('border-left', '1px solid #000000')]},  # Left border for first column
         {'selector': 'th:last-child, td:last-child',
          'props': [('border-right', '1px solid #000000')]},  # Right border for last column
    ]).set_properties(**{'text-align': 'center'}).hide(axis='index')

# Assuming X_test and y_test are already defined and both models are trained
fnn_predictions = make_predictions(fnn_model, X_test)
dnn_predictions = make_predictions(dnn_model_with_dropout, X_test)

# Compare predictions
model_names = ['FNN Model', 'DNN Model with Dropout']
comparison_df = compare_predictions(y_test, fnn_predictions, dnn_predictions, model_names)

# Apply enhanced styling to the DataFrame
styled_comparison_df = style_comparison_df(comparison_df)

# Display the styled DataFrame
styled_comparison_df
Out[87]:
Actual FNN Model Prediction DNN Model with Dropout Prediction
1.740466 2.096180 2.211203
2.667228 2.634672 2.609906
3.186353 3.167078 3.030328
2.928524 2.890035 2.755300
3.077312 3.069199 2.937331
2.602690 2.580199 2.546653
3.374169 3.367466 3.265251
2.351375 2.387134 2.407455
2.208274 2.291594 2.302597
3.303217 3.297326 3.251602
2.151762 2.261909 2.315779
2.251292 2.317910 2.388782
3.397858 3.392564 3.265052
2.714695 2.673878 2.570694
3.000720 2.974777 2.818685
3.610918 3.672984 3.613905
3.314186 3.302397 3.181728
3.433987 3.441802 3.263703
2.856470 2.814801 2.817613
3.206803 3.186814 3.043157
3.335770 3.322717 3.113991
3.049273 3.051577 2.984020
3.471966 3.490082 3.387378
2.624669 2.591963 2.535012
2.360854 2.391650 2.396556
2.533697 2.526640 2.630626
2.282382 2.342832 2.475944
2.980619 2.955317 2.861118
3.443618 3.453160 3.417614
2.433613 2.434926 2.376459
3.095578 3.081542 2.950192
2.674149 2.623085 2.492307
2.292535 2.341393 2.381365
2.884801 2.854247 2.844613
2.917771 2.890929 2.905307
3.020425 2.731265 2.661855
1.974081 2.176432 2.366163
3.459466 3.469054 3.292425
3.353407 3.340794 3.138596
3.549617 3.582895 3.410199
2.667228 2.628853 2.682498
3.325036 3.307521 3.147686
2.484907 2.481866 2.541354
3.214868 3.200366 3.080044
2.839078 2.788263 2.621739
2.351375 2.376062 2.459970
2.939162 2.903844 2.811548
3.526361 3.558652 3.429070
In [83]:
def plot_actual_vs_predicted(comparison_df, model_names):

    # Sample 5 random indices
    sample_indices = np.random.choice(comparison_df.index, size=5, replace=False)
    
    # Create a subset DataFrame for plotting
    sample_df = comparison_df.loc[sample_indices]
    
    plt.figure(figsize=(16, 8))

    # Plot for Model 1
    plt.subplot(1, 2, 1)
    plt.scatter(sample_df['Actual'], sample_df[f'{model_names[0]} Prediction'], 
                color='#eac086', alpha=0.7, edgecolors='black', label='Predictions')
    plt.plot([sample_df['Actual'].min(), sample_df['Actual'].max()],
             [sample_df['Actual'].min(), sample_df['Actual'].max()], 'k--', lw=2, label='Perfect Fit')
    plt.xlabel('Actual Body Fat Percentage', fontsize=12)
    plt.ylabel(f'{model_names[0]} Predictions', fontsize=12)
    plt.title(f'{model_names[0]}: Actual vs Predicted (Sampled)', fontsize=14)
    plt.legend()
    
    # Add value labels for sampled points
    for i in range(len(sample_df)):
        plt.text(sample_df['Actual'].iloc[i], sample_df[f'{model_names[0]} Prediction'].iloc[i], 
                 f'{sample_df[f"{model_names[0]} Prediction"].iloc[i]:.2f}', fontsize=8, color='black')

    # Plot for Model 2
    plt.subplot(1, 2, 2)
    plt.scatter(sample_df['Actual'], sample_df[f'{model_names[1]} Prediction'], 
                color='#eac086', alpha=0.7, edgecolors='black', label='Predictions')
    plt.plot([sample_df['Actual'].min(), sample_df['Actual'].max()],
             [sample_df['Actual'].min(), sample_df['Actual'].max()], 'k--', lw=2, label='Perfect Fit')
    plt.xlabel('Actual Body Fat Percentage', fontsize=12)
    plt.ylabel(f'{model_names[1]} Predictions', fontsize=12)
    plt.title(f'{model_names[1]}: Actual vs Predicted (Sampled)', fontsize=14)
    plt.legend()
    
    # Add value labels for sampled points
    for i in range(len(sample_df)):
        plt.text(sample_df['Actual'].iloc[i], sample_df[f'{model_names[1]} Prediction'].iloc[i], 
                 f'{sample_df[f"{model_names[1]} Prediction"].iloc[i]:.2f}', fontsize=8, color='black')

    plt.tight_layout()
    plt.show()

# Plot the comparison with sampled values
plot_actual_vs_predicted(comparison_df, model_names)
No description has been provided for this image
In [ ]:
 

🏆 Best DL Model: FNN Model 🏆¶

The FNN Model (Feedforward Neural Network) is declared the winner over the DNN Model with Dropout based on the following performance metrics:

  • Higher Validation R² Score: The FNN model achieved a Validation R² Score of 0.970, significantly higher than the DNN model's score of 0.788. This indicates that the FNN model better explains the variance in the validation data.
  • Lower Validation MAE and RMSE: The FNN model has a lower Validation Mean Absolute Error (MAE) of 0.0357 and a Validation Root Mean Squared Error (RMSE) of 0.0755 compared to the DNN model's 0.1586 (MAE) and 0.2020 (RMSE), suggesting more accurate predictions and less deviation from actual values.
  • Better Generalization: The smaller gap between training and validation performance in the FNN model implies that it generalizes well to new data, whereas the DNN model shows signs of potential overfitting or underfitting.

Overall, the FNN Model outperforms the DNN Model with Dropout in terms of both accuracy and consistency, making it the superior choice for this problem.

In [ ]:
 
In [ ]:
 
In [ ]:
 

👉 | Saving Model

6 | Conclusion¶

🏆 Model Rankings 🏆¶

After comparing all five models, here are the top three models based on overall performance metrics:

1st Place: XGBRegressor: The Supreme One¶

The XGBRegressor emerges as the best model among all due to its excellent balance between accuracy, minimal error rates, and strong variance explanation:

  • Highest R² Score: 0.9898, meaning it captures nearly 99% of the variance in the target variable.
  • Lowest RMSE: 0.0472, indicating minimal prediction errors.
  • Very Low MAE and MAPE: An MAE of 0.0229 and MAPE of 0.00895 show its superior prediction accuracy.
  • High Robustness: With a Median AE of 0.0096 and the highest Explained Variance Score of 0.9899, it is the most reliable model overall.

Conclusion for 1st Place:¶

The XGBRegressor is the supreme model due to its outstanding predictive performance, robustness, and minimal error rates, making it the most reliable choice for this task.


2nd Place: GradientBoostingRegressor¶

The GradientBoostingRegressor closely follows as the second-best model with its strong performance across various metrics:

  • High R² Score: 0.9879, just slightly below the XGBRegressor.
  • Low RMSE: 0.0515, only marginally higher than the XGBRegressor.
  • Lowest MAE: 0.0224, indicating the smallest average absolute error among all models.
  • Exceptional Stability: With a Median AE of 0.0046, it even outperforms the XGBRegressor in this metric.
  • High Explained Variance Score: 0.9879, confirming its strong ability to explain data variance.

Conclusion for 2nd Place:¶

The GradientBoostingRegressor is an excellent alternative to the XGBRegressor, with slightly lower performance in a few areas but still delivering highly competitive results.


3rd Place: FNN Model (Feedforward Neural Network)¶

The FNN Model secures the third position due to its solid generalization capabilities and competitive performance:

  • High Validation R² Score: 0.970, indicating good accuracy and ability to explain variance.
  • Lower Validation Errors: A Validation MAE of 0.0357 and RMSE of 0.0755 show it performs well, though not as strongly as the top two models.
  • Balanced Generalization: The FNN Model has a tight alignment between training and validation scores, suggesting it generalizes well without overfitting.

Conclusion for 3rd Place:¶

The FNN Model is a strong contender with reliable accuracy and low error rates, though it falls short of the extreme performance seen in the XGBRegressor and GradientBoostingRegressor.


Final Rankings:¶

  1. XGBRegressor — Supreme model with the best overall performance.
  2. GradientBoostingRegressor — Close competitor with nearly equal results.
  3. FNN Model — A robust model with good accuracy and generalization.

with good accuracy and generalization.

7 | Future Aspects¶

  • Try out hyperparamter tuning of the model.
  • Deploy the model.